These notes are based on Chapter 7.4-7.6 of Kroese, D. P., & Chan, J. C. (2014). Statistical modeling and computation. New York: Springer.
In principle, the rejection sampling can work in a high-dimensional problem if we can find a PDF \(g\) from which we can efficiently sample and \(C\) for a tight envelope \(Cg(x)\). It turns out this is a big IF. In 1-D cases, looking at the graph of target \(f(x)\), we are often able to design a reasonable \(Cg(x)\). Even if an envelope is not very tight, we can wait for some time and obtain enough samples from \(\text{U}(0,Cg(x))\) that are smaller than \(f(x)\). This is unlikely the case in high-dimensional problems. Our visual images in a 4-D or higher space are poor (at least for me). In addition, the rejection rate exponentially increases with dimensions, and we may not be allowed to wait for hours or days.
Markov chain Monte Carlo (MCMC) is a general class of methods to sequentially generate samples from an approximate distribution that increasingly resembles a target distribution. A sequence of samples are generated by simulating a Markov chain, which is designed to have the limiting distribution equal to the target distribution. By being content with only approximate distributions, MCMC enables us to handle functions on high-dimensional spaces, which are common in modern complex problems.
Since Monte Carlo simply refers to stochastic simulation, we might just call it Markov chain simulation. But, MCMC has sort of become the standard name everybody using statistical computing understands, perhaps because the acronym is just catchy.
A Markov chain is a sequence of random variables \(X_1,X_2\dots\) (i.e., a stochastic process) whose futures are conditionally independent of the past given the present. That is, for all \(i\geq0\), \[(X_{i+1}|X_i,X_{i-1},\dots) \sim (X_{i+1}|X_i) .\]
In an IID sequence (e.g., Bernoulli process), futures are independent even of the present.
If each \(X_i\) is a discrete random variable, \[P(X_{i+1}=x_{i+1}|X_i=x_i,X_{i-1}=x_{i-1},\dots) = P(X_{i+1}=x_{i+1}|X_i=x_i).\] In particular, if \(x_{i+1}\) is an element of a finite set that is common for all \(i\geq0\), then the Markov chain can be represented by the transition matrix, which we denote by \(Q\). That is, for all \(i\geq0\), \[Q(j,k) = P(X_{i+1}=k|X_i=j) = P(X_1=k|X_0=j)\] where \(j,k\) are elements of the finite set, which is usually called state space. For instance, if we start with \(X_0=j\), the probability of \(X_2=k\) is \[\begin{align} P(X_2=k|X_0=j) &= \sum_{s=1}^6 P(X_2=k,X_1=s|X_0=j)\\ &= \sum_{s=1}^6 P(X_2=k|X_1=s)\cdot P(X_1=s|X_0=j)\\ &= \sum_{s=1}^6 Q(j,s)\cdot Q(s,k)\\ &= Q^2(j,k)\\ \end{align}\]
To see the last equality, giveh \(j\) and \(k\), take the \(j\)th row and the \(k\)th column of \(Q\) and take their dot-product, which is exactly the element of \(Q^2\) at \((j,k)\).
As a concrete example, the following figure illustrates a Markov chain with state space of \(\{1,2,3,4,5,6\}\).
Q## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0 0.2 0.0 0.3 0.5 0.0
## [2,] 0.5 0.0 0.5 0.0 0.0 0.0
## [3,] 0.3 0.0 0.0 0.7 0.0 0.0
## [4,] 0.0 0.0 0.0 0.0 0.0 1.0
## [5,] 0.0 0.0 0.0 0.8 0.0 0.2
## [6,] 0.0 0.0 0.1 0.0 0.9 0.0
For example, from \(Q^2\), you can read off the probability of \(X_2=5\) given \(X_0=3\) as 0.15.
Q%*%Q## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.10 0.00 0.10 0.40 0.00 0.40
## [2,] 0.15 0.10 0.00 0.50 0.25 0.00
## [3,] 0.00 0.06 0.00 0.09 0.15 0.70
## [4,] 0.00 0.00 0.10 0.00 0.90 0.00
## [5,] 0.00 0.00 0.02 0.00 0.18 0.80
## [6,] 0.03 0.00 0.00 0.79 0.00 0.18
In general, the probability of \(X_n=k\) given \(X_0=j\) is \(Q^n(j,k)\).
For most MCMC applications, we are interested in continuous random variables, i.e., the state space is uncountable. In this case, for \(A\subset\mathbb{R}^d\), \[P(X_{i+1}\in A|X_i=x_i,X_{i-1}=x_{i-1},\dots) = P(X_{i+1}\in A|X_i=x_i) .\]
In particular, we assume the conditional PDF \(f_{X_{i+1}|X_i}(y|x)\) does not depend on \(i\) and denote it by \(q\). That is, for all \(i\geq0\), \[q(y|x) = f_{X_{i+1}|X_i}(y|x)=f_{X_1|X_0}(y|x),\] where \(x,y\in\mathbb{R}^d\) are states. We call \(q\) the transition density of the (continuous-state) Markov chain.
While in general each of \(X_0,X_1,\dots\) follows a distinct marginal distribution, there is a special common distribution. A state distribution (i.e., marginal distribution) \(f\) is said to be stationary if \(f\) satisfies the following global balance equations: \[f(x_{i+1}) = \int f(x_{i})q(x_{i+1}|x_{i})dx_{i}.\]
To see how special the global balance is, recall that we always have \[f_{i+1}(x_{i+1}) = \int f_{i}(x_{i})q(x_{i+1}|x_{i})dx_{i},\] by marginalisation, i.e., integrating out \(x_i\) from the joint PDF. In general, two marginal PDFs \(f_{i+1}\) and \(f_{i}\) need not be the same. However, if they happen to be identical under the transition \(q\), then \(f_{i+1} = f_{i} = f\) is a stationary distribution in the Markov chain characterised by \(q\).
Keep in mind that it is a set of equations, as they holds for all \(x_{i+1}\). This is easy to see in a finite-state case: \[p_{i+1} = p_{i}Q,\] where \(p_i\) and \(p_{i+1}\) are state distributions. That is, each column is a result of marginalisation, and they are computed all at once using the matrix multiplication by \(Q\). Naturally, a state distribution \(p\) is stationary if \[p = pQ.\]
Given initial value/state \(x_0\), using \(q\), we can simulate a Markov chain, i.e., sequentially obtain samples \(x_1,x_2,\dots,x_n\). In other words, we get a sample from the following joint PDF \[f(\tilde{x}_1,\dots,\tilde{x}_i) = q(\tilde{x}_i|\tilde{x}_{i-1})q(\tilde{x}_{i-1}|\tilde{x}_{i-2}) \cdots q(\tilde{x}_1|\tilde{x}_0)f(\tilde{x}_0).\] The rationale for Markov chain as a method for sampling from a desired target distribution is:
Samples from the first \(I\) steps (so-called “burn-in” period) are often discarded.
It is easy to approximate the limiting distribution for a finite-state case. For instance, for \(Q\) defined above, the stationary distribution \(p\) that satisfies \[p = pQ,\] is an eigenvector of \(Q^T\) for eigenvalue of \(1\).
p <- Re(eigen(t(Q))$vectors[,1]) # eigenvector for eigenvalue = 1
p <- p/sum(p) # normalisation
p## [1] 0.011978917 0.002395783 0.035936751 0.283660757 0.318639195 0.347388596
We can see \(Q^{500}\) having the identical rows, each of which is almost equal to the stationary distribution.
n <- 500
Qn <- Q%^%n
Qn## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [2,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [3,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [4,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [5,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [6,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
What does this imply? For example, suppose \(X_0=4\) and \(p_0 = (0,0,0,1,0,0)\). Then,
p0 <- c(0,0,0,1,0,0)
p0%*%Qn## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
In general,
p0 <- runif(6)
p0 <- p0/sum(p0) # random initial state distribution
p0%*%Qn## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
No matter which state we are in at the beginning, we will end up with the same state distribution at time 500 onwards.
In other words, an estimate of this limiting distribution using MCMC is
n <- 1e6
x <- rep(0, n)
x[1] <- sample(1:6, 1) # random initial state
for (i in 2:n) {
x[i] <- sample(1:6, 1, prob=Q[x[i-1],]) # MCMC sampling
}
table(x[501:n])/(n-500) # 500 burn-in periods##
## 1 2 3 4 5 6
## 0.012230115 0.002498249 0.036043022 0.283321661 0.318564282 0.347342671
In contrast, if we sample from the true stationary distribution, an MC estimate is
x <- sample(1:6, n-500, replace=T, prob=p)
table(x)/(n-500)## x
## 1 2 3 4 5 6
## 0.011893947 0.002381191 0.035879940 0.284266133 0.317587794 0.347990995
The task in MCMC applications is to design a good Markov chain, equivalently design a good transition \(q\). Key properties of good \(q\) are:
Let’s discuss the second point, convergence, while we will come back to the other two later.
Ergodicity is the condition for a Markov chain to converge to a unique stationary distribution irrespective of \(x_0\). It consists of two intuitive properties: irreducibility and aperiodicity.
Irreducibility ensures that every state \(x\) can be visited from every other state within finite steps \((i<\infty)\). This is important because, in MCMC, a sequence of samples are generated through dependent sampling according to the transition density \(q\), and some \(x_{i+1}\) may not be reached directly from current \(x_i\), i.e., \(q(x_{i+1}|x_i)=0\).
Aperiodicity prevents indefinite oscillation and allows the chain to converge to a stationary distribution.
See Roberts & Rosenthal (2004) for more details.
Among the ergodic Markov chains, we could in principle choose a transition density \(q\) that satisfies the global balance equations in order to ensure that the chain converges to our target \(f\). In practice, however, this is tricky and challenging because of the integral in the definition the global balance equations. So, we typically impose stronger structure on \(q\) called the detailed balance equations: \[f(x)q(y|x) = f(y)q(x|y), \;\forall x,y.\]
By integrating both sides over \(x\) (or \(y\)), we see the detailed balance equations imply the global balance equations.
Here is an example of MCMC for sampling from a bivariate normal distribution \(\text{N}(\mu,\Sigma)\) with \[ \mu=0 \quad\text{and}\quad \Sigma=\begin{bmatrix}1 & 0.7\\0.7 & 1\\\end{bmatrix} .\]
Remember that the joint PDF looks like this:
Using Gibbs sampler, a result may look like the following.
set.seed(123)
n <- 300
m <- 3
# Initialisation
x <- -1
xx <- rep(x, n)
y <- 2
yy <- rep(y, n)
# Run
for (i in 2:n) {
x <- rnorm(1, mean=r*y, sd=sqrt(1-r^2))
xx[i] <- x
y <- rnorm(1, mean=r*x, sd=sqrt(1-r^2))
yy[i] <- y
par(mar=c(5,5,.1,.1))
plot(xx[1], yy[1], xlab="x", ylab="y", xlim=c(-3,3), ylim=c(-3,3), pch=4)
if (i <= m) {
segments(xx[-i],yy[-i],xx[-1],yy[-1],lwd=.5)
} else {
points(xx[2:(i-(m-1))], yy[2:(i-(m-1))], pch=20, cex=.4)
segments(xx[(i-m):(i-2)], yy[(i-m):(i-2)],
xx[(i-(m-1)):(i-1)], yy[(i-(m-1)):(i-1)], lwd=.5)
}
arrows(xx[i-1],yy[i-1],xx[i],yy[i], length=0.08, angle=20, lwd=1)
}Pay attention to the dependency of each new point on the previous point; each arrow goes only so far. Each is not an IID sample from the target distribution. But, collectively, these points are distributed as if each is sampled from the target distribution.
Another example that consists of two clusters gives us an even clearer picture of dependency. Two clusters consists of \(\text{N}(\mu_0,\Sigma)\) and \(\text{N}(\mu_1,\Sigma)\), where \[ \mu_0=\begin{bmatrix}-1.7\\-1.7\end{bmatrix},\; \mu_1=\begin{bmatrix}1.7\\1.7\end{bmatrix}, \;\text{and}\quad \Sigma=\begin{bmatrix}1 & 0\\0 & 1\\\end{bmatrix}\] and the cluster assignment \(C\) follows \(\text{Bernoulli}(0.5)\). In other words, the joint PDF is \[f(x_1,x_2,c) = \frac{1}{2}\frac{1}{2\pi}\exp\left(-\frac{1}{2}(x-\mu_c)^T(x-\mu_c)\right)\]
set.seed(123)
n <- 300
m <- 3
burn_in <- 5000
C <- c(-1.7,1.7)
# Initialisation
x <- 0
y <- 0
c <- C[1]
# Run for the burn-in
for (i in 2:burn_in) {
x <- rnorm(1, c)
y <- rnorm(1, c)
k <- f(x,y,C[1]) + f(x,y,C[2])
c <- sample(c(C[1],C[2]), size=1, prob=c(f(x,y,C[1])/k, f(x,y,C[2])/k))
}
# Run for the plot
xx <- rep(x,n)
yy <- rep(y,n)
cc <- rep(c,n)
for (i in 2:n) {
x <- rnorm(1, c)
xx[i] <- x
y <- rnorm(1, c)
yy[i] <- y
k <- f(x,y,C[1]) + f(x,y,C[2])
c <- sample(c(C[1],C[2]), size=1, prob=c(f(x,y,C[1])/k, f(x,y,C[2])/k))
cc[i] <- c
par(mar=c(5,5,.1,.1))
plot(xx[1], yy[1], xlab="x", ylab="y", xlim=c(-4,4), ylim=c(-4,4),
pch=20, cex=.4, col=cc[1]/C[2]+3)
if (i <= m) {
segments(xx[-i],yy[-i],xx[-1],yy[-1],lwd=.5)
} else {
points(xx[2:(i-(m-1))], yy[2:(i-(m-1))], pch=20, cex=.4, col=cc[2:(i-(m-1))]/C[2]+3)
segments(xx[(i-m):(i-2)], yy[(i-m):(i-2)],
xx[(i-(m-1)):(i-1)], yy[(i-(m-1)):(i-1)], lwd=.5)
}
arrows(xx[i-1],yy[i-1],xx[i],yy[i], length=0.08, angle=20, lwd=1)
}As you can see, most samples are followed by samples in the same cluster. If samples were independent, the arrow would jump between the clusters almost every other time.
Since convergence requires more time in this example than the first example, 5000 earliest samples are discarded as a burn-in period.
A takeaway message at this point is that, by running MCMC for a long time, we can collect samples as if they were drawn from the target PDF. Since MC estimates often require many samples anyway, running for a long time may not seem like an issue. However, it may be too long and collecting much more samples than required for good MC estimation. In practice, we prefer 1,000 good samples to 1,000,000 good samples when 1,000 good samples are good enough.
In practice, most of us do not design a transition density \(q\) from scratch; rather, we use proven design templates. The Metropolis-Hastings algorithm provides a good template. Suppose we have an ergodic Markov chain characterised by a transition density \(q\). The Metropolis-Hastings algorithm tells us how to build upon this Markov chain and design another one that satisfies the detailed balance equations with respect to the target \(f\).
The ergodiciy of the new chain is implied by the ergodiciy of the origianl Markov chain with \(q\).
The idea is similar to the rejection sampling.
The acceptance probability is defined as follows: \[\alpha(x,y) = \min\left(1,\; \frac{f(y)q(x|y)}{f(x)q(y|x)}\right).\]
In the special case of symmetric transition density, i.e., for all \(x,y\), \[q(y|x) = q(x|y),\] it is referred to as Metropolis algorithm with the acceptance probability \[\alpha(x,y) = \min\left(1,\; \frac{f(y)}{f(x)}\right).\] The one which is naïve yet widely used in practice is a normal distribution centred at \(x\): \[q(y|x) \propto \exp\left(-\frac{1}{2\sigma^2}(y-x)^T(y-x)\right)\] where \(\sigma^2\) is the common variance across all the component of \(y\). Hence, symmetric transition density.
An intuition is that, when moving to a state of higher probability/density, we always accept the proposal \(y\). Otherwise, we make a probabilistic decision according to the ratio \(f(y)/f(x)\).
With the acceptance probability defined above, the new transition density for \(x_{i+1}\neq x_i\) is \[\tilde{q}(x_{i+1}|x_i) = q(x_{i+1}|x_i)\alpha(x_i,x_{i+1}) .\] Less importantly, the probability of not moving is \[P(X_{i+1}=x_i|X_i=x_i) = 1 - \int_{y\neq x_i} q(y|x_i)\alpha(x_i,y)dy .\]
Let’s make sure that \(\tilde{q}\) satisfies the detailed balance equations: \[f(x)\tilde{q}(y|x) = f(y)\tilde{q}(x|y), \;\forall x,y.\] First, notice that, regardless of \(\tilde{q}\), it trivially holds for \(y=x\). This is why \(X_{i+1}=x_i\) is unimportant. Next, for \(y\neq x\), \[\begin{align} f(x)\tilde{q}(y|x) &= f(x)q(y|x)\alpha(x,y)\\ &= f(x)q(y|x)\min\left(1, \frac{f(y)q(x|y)}{f(x)q(y|x)}\right)\\ &= \min\left(f(x)q(y|x),\; f(y)q(x|y)\right)\\ &= \min\left(f(y)q(x|y),\; f(x)q(y|x)\right)\\ &= f(y)\tilde{q}(x|y). \end{align}\]
Gibbs sampling (a.k.a. alternating conditional sampling) is a special case of the Metropolis-Hastings algorithm and widely used to address multidimensional problems. It is particularly suitable for Bayesian hierarchical models due to the modularised model construction. Suppose \(x\) is a vector in a multidimensional space and divided into \(d\) sub-vectors \(x=(x^{(1)},\dots,x^{(d)})\). (n.b. Each sub-vector need not be one-dimensional.) Let \(f^{(k)}\) be the conditional PDF of \(x^{(k)}\) \((k\in\{1,\dots,d\})\) given the values of the other sub-vectors: \[f^{(k)}(x^{(k)}|x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)}) .\] Gibbs sampler assumes we can sample from each of these conditional PDFs, and proceeds as follows.
You get the idea. \(x_{i+1} = (x_{i+1}^{(1)},\dots,x_{i+1}^{(d)})\) is technically a proposal in the context of the Metropolis-Hastings algorithm, but \(x_{i+1}\) is always accepted in Gibbs sampler.
Go back to the demos above and see whether the code makes sense.
It is straightforward to recognise \(f\) as stationary under Gibbs sampling. Consider sampling \(x^{(k)}\) using \[f^{(k)}(x^{(k)}|x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)}) .\] First, \(f^{(-k)}(x^{(-k)})\) where \(x^{(-k)} = (x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)})\) remains invariant because no sampling occurs for \(x^{(-k)}\). Next, by definition, Gibbs sampler draws a sample of \(x^{(k)}\) from the correct conditional distribution \(f^{(k)}(x^{(k)}|x^{(-k)})\). Thus, the joint PDF \[f(x) = f^{(k)}(x^{(k)}|x^{(-k)})\cdot f^{(-k)}(x^{(-k)})\] remains invariant under Gibbs sampling.
Looking at a sample \(x_{i+1}^{(k)}\) drawn from \[f^{(k)}(x_{i+1}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)}),\] we may put \(x_{i+1}^{(k)}\) in a full-length vector \(x\) and index it as \(j+1\). That is, \[\begin{align} \vdots&\\ x_{j} &= (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ x_{j+1} &= (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i+1}^{(k)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ \vdots& \end{align}\]
Note that the transition densities of \(x_j\to x_{j+1}\) and \(x_{j+1}\to x_j\) are both given by the conditional PDF for \(x^{(k)}\), \[\begin{align} q(x_{j+1}|x_j) &= f^{(k)}(x_{i+1}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ &= f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})\\ q(x_j|x_{j+1}) &= f^{(k)}(x_{i}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ &= f^{(k)}(x_{i}^{(k)}|x^{(-k)}), \end{align}\] where we use \(x^{(-k)} = x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)}\). Now, it follows that \[\begin{align} \alpha(x_{j},x_{j+1}) &= \min\left(1,\; \frac{f(x_{j+1})\cdot q(x_{j}|x_{j+1})}{f(x_{j})\cdot q(x_{j+1}|x_{j})}\right)\\ &= \min\left(1,\; \frac{f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})\cdot f^{(-k)}(x^{(-k)})\cdot f^{(k)}(x_{i}^{(k)}|x^{(-k)})}{f^{(k)}(x_{i}^{(k)}|x^{(-k)})\cdot f^{(-k)}(x^{(-k)})\cdot f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})}\right)\\ &= \min(1, 1)\\ &= 1 \end{align}\] Hence, a proposal is always accepted.
Imagine you try to model the number of customers at each of two bars in Melbourne using Poisson distribution, namely \(\text{pois}(\lambda_1)\) for Bar 1 and \(\text{pois}(\lambda_2)\) for Bar 2. Based on other information, you think \(\lambda_1\) and \(\lambda_2\) are distinct but somehow related. So, you assume both \(\lambda_1\) and \(\lambda_2\) to be distributed as \(\text{exp}(\theta)\). Finally, you complete the model by simply assuming \(\theta \sim \text{exp}(1)\).
Suppose you are going to collect \(m_1\) samples from Bar 1 and \(m_2\) samples from Bar 2. Then, your model is:
\[\begin{gather} X_1, X_2, \dots, X_{m_1} \sim \text{pois}(\lambda_1)\\ X_{m_1+1}, X_{m_1+2}, \dots, X_{m_1+m_2} \sim \text{pois}(\lambda_2)\\ \lambda_1, \lambda_2 \sim \text{exp}(\theta)\\ \theta \sim \text{exp}(1) \end{gather}\] (As usual, the observations are independent. In addition, \(\lambda_1\) and \(\lambda_2\) are independent given \(\theta\).)
You naturally expect learning something about the average numbers of customers at Bar 1 and Bar 2 (\(\lambda_1\) and \(\lambda_2\)) and how they are related.
Given the collected data \[\mathbf{x}=(\mathbf{x}_1,\mathbf{x}_2)=(x_1,\dots,x_{m_1},x_{m_1+1},\dots,x_{m_1+m_2}),\] you derive \(f(\lambda_1, \lambda_2, \theta|\mathbf{x})\), a joint PDF of \((\lambda_1, \lambda_2, \theta)\) conditional on \(\mathbf{x}\).
\[\begin{align} f(\lambda_1,\lambda_2,\theta|\mathbf{x}) &= \frac{f(\mathbf{x},\lambda_1,\lambda_2,\theta)}{f(\mathbf{x})}\\ &\propto f(\mathbf{x},\lambda_1,\lambda_2,\theta)\\ &= f(\mathbf{x}|\lambda_1,\lambda_2,\theta)\cdot f(\lambda_1,\lambda_2,\theta)\\ &= f((\mathbf{x}_1,\mathbf{x}_2)|\lambda_1,\lambda_2)\cdot f(\lambda_1,\lambda_2|\theta)\cdot f(\theta)\\ &= f(\mathbf{x}_1|\lambda_1)\cdot f(\mathbf{x}_2|\lambda_2)\cdot f(\lambda_1|\theta)\cdot f(\lambda_2|\theta)\cdot f(\theta)\\ &= \prod_{i=1}^{m_1}f(x_i|\lambda_1) \cdot \prod_{j=1}^{m_2}f(x_j|\lambda_2) \cdot f(\lambda_1|\theta) \cdot f(\lambda_2|\theta) \cdot f(\theta)\\ &= \prod_{i=1}^{m_1}\frac{e^{-\lambda_1}\lambda_1^{x_i}}{x_i!}\cdot\prod_{j=1}^{m_2}\frac{e^{-\lambda_2}\lambda_2^{x_j}}{x_j!}\cdot\theta e^{-\theta\lambda_1}\cdot\theta e^{-\theta\lambda_2}\cdot e^{-\theta}\\ &\propto e^{-m_1\lambda_1}\lambda_1^{\sum_{i=1}^{m_1}x_i}\cdot e^{-m_2\lambda_2}\lambda_2^{\sum_{j=1}^{m_2}x_j}\cdot\theta e^{-\theta\lambda_1}\cdot\theta e^{-\theta\lambda_2}\cdot e^{-\theta}\\ &= \theta^2 \cdot e^{-\theta} \cdot e^{-(\theta+m_1)\lambda_1}\cdot e^{-(\theta+m_2)\lambda_2} \cdot \lambda_1^{\sum_{i=1}^{m_1}x_i} \cdot \lambda_2^{\sum_{j=1}^{m_2}x_j} \end{align}\]
\(f(\lambda_1, \lambda_2, \theta|\mathbf{x})\) does not look like any familiar PDF.
Let’s implement a Gibbs sampler to sample from this joint PDF. First, we need to identify the following full conditionals from the above joint (perhaps, by purposefully organising variables and eyeballing).
\[\begin{align} f(\lambda_1|\lambda_2,\theta,\mathbf{x}) &\propto \lambda_1^{\sum_{i=1}^{m_1}x_i} e^{-(\theta+m_1)\lambda_1} \sim \text{gamma}\left(1+\sum_{i=1}^{m_1}x_i, \theta+m_1\right)\\ f(\lambda_2|\lambda_1,\theta,\mathbf{x}) &\propto \lambda_2^{\sum_{j=1}^{m_2}x_j} e^{-(\theta+m_2)\lambda_2} \sim \text{gamma}\left(1+\sum_{j=1}^{m_2}x_j, \theta+m_2\right)\\ f(\theta|\lambda_1,\lambda_2,\mathbf{x}) &\propto \theta^2 e^{-(1+\lambda_1+\lambda_2)\theta} \sim \text{gamma}(3, 1+\lambda_1+\lambda_2). \end{align}\]
Then, you cycle through them for \(n\) rounds of sampling and return a \(n\times 3\) matrix.
sampler <- function(n, x1, x2) {
set.seed(90045)
m1 <- length(x1)
m2 <- length(x2)
s1 <- sum(x1)
s2 <- sum(x2)
D <- matrix(0, n, 3)
for (i in 2:n) {
D[i,1] <- rgamma(1, 1+s1, m1+D[i-1,3])
D[i,2] <- rgamma(1, 1+s2, m2+D[i-1,3])
D[i,3] <- rgamma(1, 3, 1+D[i,1]+D[i,2])
}
return(D)
}For example,
x1 <- c(10,9,6,19,8,13,15,7,11)
x2 <- c(14,10,11,9,9,6,13,19,10,8,6,13,8,15,21,7,12,11)
n <- 1e6
D <- sampler(n, x1, x2)
burn_in <- floor(0.05*n) # throwing away 1st 5% of samples
D <- D[(burn_in+1):n,]Using the samples D, you can approximate many
quantities. For example,
\[\begin{align} \mu_1 &= \mathbb{E}[\lambda_1|\mathbf{x}]\\ \mu_2 &= \mathbb{E}[\lambda_2|\mathbf{x}]\\ \sigma^2_1 &= \mathbb{E}[(\lambda_1-\mu_1)^2|\mathbf{x}]\\ \sigma^2_2 &= \mathbb{E}[(\lambda_1-\mu_2)^2|\mathbf{x}]\\ \text{Cov}(\lambda_1,\lambda_2) &= \mathbb{E}[(\lambda_1-\mu_1)(\lambda_2-\mu_2)|\mathbf{x}]\\ \rho &= \frac{\text{Cov}(\lambda_1,\lambda_2)}{\sigma_1\sigma_2} \end{align}\]
MC estimates are
mu1 <- mean(D[,1])
mu2 <- mean(D[,2])
sig1 <- mean((D[,1]-mu1)^2)
sig2 <- mean((D[,2]-mu2)^2)
cov <- mean((D[,1]-mu1)*(D[,2]-mu2))
rho <- cov/sqrt(sig1*sig2)\[\hat{\mu}_1 = 10.84,\; \hat{\mu}_2 = 11.2,\; \hat{\rho} = 0.0047.\] It seems two bars have very similar average customers. However, as far as the correlation is concerned, they are unrelated.